Introduction

As part of the HLG-MOS challenge, we decided to use the toy data set satgpa. This data set contains 1000 observations and 6 variables

  • sex: Gender of the student.
  • sat_v: Verbal SAT percentile.
  • sat_m: Math SAT percentile.
  • sat_sum: Total of verbal and math SAT percentiles.
  • hs_gpa: High school grade point average.
  • fy_gpa: First year (college) grade point average.

The variable sat_sum is not considered in the analysis because it is fully defined by sat_sum = sat_v + sat_m. We used the RSynthpop package to generate the synthetic data. We then compared the synthetic data to the original data to measure how much information is lost during the process. Finally, we try to measure the privacy offered by the synthetic data set.

Synthetic data generation

Using the R package RSynthpop, the synthetic data was produced using the function syn() and the summaries below were produced from using the function summary() from the synthpop library.

synth_data <- syn(data[,c(1,2,3,5,6)], minnumlevels=2)

Original data

##       sex            sat_v           sat_m          hs_gpa          fy_gpa     
##  Min.   :1.000   Min.   :24.00   Min.   :29.0   Min.   :1.800   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:43.00   1st Qu.:49.0   1st Qu.:2.800   1st Qu.:1.980  
##  Median :1.000   Median :49.00   Median :55.0   Median :3.200   Median :2.465  
##  Mean   :1.484   Mean   :48.93   Mean   :54.4   Mean   :3.198   Mean   :2.468  
##  3rd Qu.:2.000   3rd Qu.:54.00   3rd Qu.:60.0   3rd Qu.:3.700   3rd Qu.:3.020  
##  Max.   :2.000   Max.   :76.00   Max.   :77.0   Max.   :4.500   Max.   :4.000

Synthetic data

##       sex            sat_v           sat_m           hs_gpa         fy_gpa     
##  Min.   :1.000   Min.   :24.00   Min.   :29.00   Min.   :2.00   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:43.00   1st Qu.:49.00   1st Qu.:2.75   1st Qu.:1.995  
##  Median :1.000   Median :49.00   Median :55.00   Median :3.20   Median :2.480  
##  Mean   :1.488   Mean   :49.17   Mean   :54.48   Mean   :3.18   Mean   :2.470  
##  3rd Qu.:2.000   3rd Qu.:55.00   3rd Qu.:60.00   3rd Qu.:3.70   3rd Qu.:3.022  
##  Max.   :2.000   Max.   :76.00   Max.   :77.00   Max.   :4.50   Max.   :4.000

Synthetic data analysis

In this section we compare the original data to the synthetic version.

Visual Inspection

import pandas as pd
import seaborn as sns
from scipy.stats import wasserstein_distance
import matplotlib.pyplot as plt
sns.set_theme(style="ticks")
#Load CSV file in a dataframe
df = pd.read_csv('./data/satgpa.csv')
df = df.drop(columns = ['sat_sum'])
dfsynth = pd.read_csv('./synthpop/synth_data.csv')

Original data

sns.pairplot(df, hue = "sex")
# Necessary to render in RStudio
# plt.show()

Synthetic data

sns.pairplot(dfsynth, hue = "sex")
# Necessary to render in RStudio
# plt.show()

Heat maps (correlation matrices)

These plots show the correlation between two variables. This type of data validation is useful if the data is Gaussian (i.e., it is normally distributed). If the data were skewed, this would not be a sufficient test as this only looks at the correlation between two variables at a time. This is an appropriate test if the intended use is to be able to perform calculations based on mean and/or covariance in one-dimension. Otherwise, if the intended use of the synthetic data is to look at relationships between multiple variables (e.g., sex, sat_v, and fy_gpa) then this test would not return that information appropriately.

These heat maps can be used to preliminary verify how similar the synthetic data is to the original data, assuming that the data are normally distributed, by comparing the correlations of two variables in the respective data sets. The heat map makes it easy to visually compare if the two data sets are producing similar results.

To examine relationships between multiple variables at a time, we can look at the Earth Mover’s Distance (Wasserstein) which will be discussed in the next section.

Original data

Synthetic data

Note: synthpop generates randomized numbers for the synthetic data; therefore, multiple outputs may not produce the same numbers. However, the synthetic data outputs should still yield similar results so that we would only really need to show one comparison to demonstrate that the synthetic data shows similar correlations to the original data.

Earth Mover’s Distance

More information re: Earth Mover’s Distance can be found here.

The EMD is also called the Wasserstein metric. It was measured between the marginal distributions using the scipy package.

Variable EMD
sex 0.004
sat_v 0.247
sat_m 0.266
hs_gpa 0.022
fy_gpa 0.018

We also measured the EMD distance between the marginal distributions conditonal on the sex variable.

Variable EMD (sex=1) EMD (sex = 2)
sat_v 0.236 0.513
sat_m 0.428 0.522
hs_gpa 0.028 0.028
fy_gpa 0.028 0.037

Excerpts from the emdist documentation

Value

Earth Mover’s Distance (EMD) between of the distributions A and B. If A and B are not distributions then A is the source and B is the target.

Details

emd2d interprets the two matrices A and B as a distribution over a two-dimensional grid. The distance between the grid points in each direction is defined by xdist and ydist. Both matrices must have the same dimensionality.

emd uses first column of each matrix as the weighs and the remaining columns as location coordinates in a up to four-dimensional space. A and B must have the same number of columns. emdw separates the weights from the location matrices but is otherwise identical to emd. emdr uses the original EMD implementation by Yossi Rubner from Stanford.

If case the two matrices A and B are not densities, the weighted sum of flows is normalized by the smaller total mass of the two. The version of the emd package released on CRAN contains only this implementation and all other functions are just front-ends for the call to emdr.

Note re: the distance argument

Distance (dist) to be used for the computation of the cost over the locations. Must be either "euclidean", "manhattan" or a [metric over a] closure taking two vectors and returning a scalar number. The latter case is much less efficient because it requires R evaluation for every possible combination of flows.


The main thing to note about the difference between emd and emd2d is that emd looks at vectors whereas emd2d looks at points; therefore, the latter calculates more combinations than the former. This would also explain why the emd2d result shows a value that is “more distant” than the emd result. Essentially, because emd2d looks at more combinations it has more to iterate over.

Below are the results and run-time for each EMD function:

Function Result Run-time
emd 1.243486 5.467048 secs
emd2d 2.925377 30.70497 mins

The result that is returned shows us how close our synthetic data is to the original data. Ideally, you would want a value of 0 (or close to 0). However, since we do not have a pre-determined threshold to gauge proximity it cannot be determined whether values returned are “close enough” to the original data set.

This threshold may differ for users depending on their use case. Nonetheless, this test can show how similar the data are to one another.

Which function should you use? It depends on your use-case.

If the goal is to create synthetic data that is as close as possible to the original data then it might be useful to use emd2d to get the most accurate picture (of the two functions). Keep in mind that this function takes significantly longer to run than emd. Regardless of which option you choose to run as your final test, it would be prudent to run emd first and see that result prior to running emd2d. The reason would be that emd is time-effective and if you return a high value with this function, you would probably return an even higher value with emd2d. In that, if you returned a value above your desired threshold, it might be more valuable to modify the parameters of your synthetic data before testing it further.

Privacy offered by the synthetic data

Record-matching as a privacy measure

Trying to match original records in synthetic data is a simple measure of privacy but a practical one. If a synthetic dataset were to be published, people would likely check whether they can find themselves in there.

Full matches

Can someone find themselves in the data?

If a record is fully replicated in the synthetic data, it is effectively disclosed.

Thankfully finding matches is easy to do in R. An inner-join of the original and synthetic data will quickly return all rows that match 100%.

# Number of 100% matched rows
full_matches = merge(satgpa, synth_bisaillon) %>% nrow()

full_matches
## [1] 37

Partial matches

Could someone recognize parts of their data?

Even if a record is not fully replicated, pieces of it may still be recognizable. For example, someone’s neighbor might be able to make guesses with part of their data. Or if only 3 variables of a record are recognizable, it may still be perceived as disclosed.

In R it is possible to run the merge() function iteratively on all combinations of columns. The combn() function will return all permutations of a vectors.

# Return all possible samples of size 2 from (a, b, c)
combn(c("a", "b", "c"), 2, simplify = FALSE)
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "a" "c"
## 
## [[3]]
## [1] "b" "c"

Partial matches are more likely than full matches. The SAT dataset has a variable for sex, so it is nearly guaranteed that at least one column can be matched. As we try to match more and more columns, we get less matches.